knitr::include_graphics("C:/Users/hp/Downloads/data viz/thd-logo.jpg")

ABOUT PROJECT TYCHO

For this project we are using Project Tycho Level 2 Data Version 1.1.0 which contains 50 diseases from all 50 states in the USA (which is approximately 1284 US cities) from the year 1888 to 2014.[0] Project Tycho is global health data platform which collects public health data from different health agencies and researchers and use the data to improve health informatics. Its mainly focus is on infectious diseases and epidemics.

  1. INTRODUCTION

Measles is a highly contagious, serious disease caused by a virus of the paramyxovirus family[1]. The disease has an incubation period of 8 - 12 days, and then presents with increasing fever (up to 39 - 40.5 degrees celsius), cough, coryza and conjunctivitis. The symptoms intensify over 2 - 4 days before the onset of rash and peak on the day one of rash. The rash usually appears first on the face and neck as discrete erythematous patches. The number increases for 2 - 3 days, mostly on the trunk and face. Discrete lesions are also usually found on the distal extremities. The rash lasts for 3- 7 days before disappearing[2].

Since the introduction of measles vaccine in 1963, CDC (Center for Disease Control) of USA began its effort to fight this illness. By the year 2000 the efforts paid off and US declared measles eliminated.[0]

Even though a safe and cost effective vaccine is available, in 2021, there were an estimated 128,000 deaths globally, mostly among non-vaccinated children under the age of 5. In that same year, only about 81% of children received one dose of the measles vaccine by their first birthday, the lowest since 2008[3]. In the year 2022, in Columbus, Ohio, there was a Measles outbreak fueled by increasing vaccine hesitancy owing to increased anti-vaccine sentiment sparked by conspiracy theories surrounding the COVID-19 pandemic[4].

Our analysis is aimed at presenting cases before and after the introduction of the Measles Containing Vaccine (MCV) in the hopes that it will dispel the reservations of parents who might be hesitant about allowing their children to receive the vaccine.

  1. PROBLEM STATEMENT

Given that 2021 was the year with the lowest immunization rate for Measles in children globally since 2008, and that 2022 saw an outbreak of measles in the United States, owing to increased vaccine-skepticism, we present the incidence of Measles cases over the years in an attempt to let the data speak for itself to vaccine skeptics on the efficacy and necessity of measles vaccination.

  1. OBJECTIVES
  1. METHODS

We made use of the following packages for comprehensive analysis:

4.1. ANALYSIS

4.4.1 Data Loading and Data Clean-up

We load our data initially with an absolute path

dataTycho<-read.csv("C:/Users/hp/Downloads/data viz/ProjectTycho_Level2_v1.1.0.csv")

We get a look at the data inside our project Tycho dataframe

dataTycho

To clean up the data, we install the tidyverse package. We’ve commented the code out because in R for Data Science, we should avoid leaving in code that might change the settings on someone else’s computer if we mean to send our project to someone else.

#install.packages("tidyverse")

Here we load the tidyverse library

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We filter the Measles data we need from the Tycho dataset using the filter() function of the tidy verse library and assign it to the dataframe “usMeasles”

usMeasles <- filter(dataTycho, disease == "MEASLES")

We take a peek at our usMeasles dataframe

usMeasles

We can see that the from_data and to_date variables are type character but we need them to be type date. To do this, we load the lubridate library

library(lubridate)

We use the ymd() function from lubridate in conjunction with the mutate() function from tidyverse to achieve our conversion of the from_date and to_date variables to data type date.

usMeasles <- mutate(usMeasles, from_date = ymd(from_date),
                    to_date = ymd(to_date))

We look into our usMeasles data frame again to check if from_date and to_date have been successfully converted. Their data type is now date.

usMeasles

We arrange the data by date using the arrange() function from tidyverse

usMeaslesArranged <- arrange(usMeasles, from_date)

We take a look at our data and ascertain that our data is now correctly arranged

usMeaslesArranged

We want to what US measles incidence per state is, so we need to group our data by state and get the total number of cases in each state. To do this, we use the group_by() and summarize() functions provided by tidyverse library and save our new data into the usMeaslesPerState dataframe

usMeaslesPerState <- usMeaslesArranged %>%
                     group_by(state) %>%
                     summarize(case_per_state = sum(number, na.rm = TRUE))

We checkout the usMeaslesPerState dataframe

usMeaslesPerState

To visualize our data, we require the usmap package

#install.packages("usmap")

We load our the usmap package

library(usmap)
## Warning: package 'usmap' was built under R version 4.3.2

The usmap package requires to variables to work. The first is fips, a representation of geographical locations in the United States with unique codes. The second variable is values, which in this case is the number of cases per state store in the case_per_state variable. We use the fips() function to convert our state into fips codes, then convert the fips codes into numeric data type and save it in a variable fips. We convert our case_per_state into numeric and save it in values.

usMeaslesPerState <- usMeaslesPerState %>%
                     mutate(fips = as.numeric(fips(state)),
                            values = as.numeric(case_per_state))

We check our update usMeaslesPerState data frame and can now see the additional variables fips and values

usMeaslesPerState

We want to remove any NA cells in the fips variable of our dataframe. We use the filter() function with the !is.na() function.

usMeaslesPerState <- usMeaslesPerState %>%
                    filter(!is.na(fips))

We check again to see that the NA values have been removed.

usMeaslesPerState

We can now call the plot_usmap() function on our usMeaslesPerState dataframe. What jumps out is that the states of New York, Pennsylvania, California and Texas have the highest incidence rates

plot_usmap(data = usMeaslesPerState) +
  scale_fill_continuous(low = "#ebf059", high = "#f05959", name = "case_per_state", label =   scales::comma) +
  labs(title = "Number of Cases Per State") +
  theme(legend.position = "right")

We can visualize this data better by using a bar chart. For that we only need our state and case_per_state variables

usMeaslesPerState <- usMeaslesPerState %>%
                     select(state, case_per_state)

We check to see if our dataframe contains only the state and case_per_state variables

usMeaslesPerState

4.1.2 Data Analysis of our DataSet

We can now plot a bar chart on our dataframe. As stated earlier, we can see that New York, Pennsylvania and California have the highest peaks. The common factor amongst all these states is that they have very high populations. so the high incidence of measles cases there is not surprising giving their high populations

ggplot(data = usMeaslesPerState, mapping = aes(x = state, y = case_per_state, fill = state)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Number of Cases Per State")

Now we proceed to our mission of showcasing the incidence of measles over time. We group our usMeaslesArranged dataframe by year and summarize by total number of cases in each year and save it into a new dataframe

usMeaslesPerYear <- usMeaslesArranged %>%
                    group_by(year = year(from_date)) %>%
                    summarize(case_per_year = sum(number))

We checkout the usMeaslesPerYear dataframe

usMeaslesPerYear

We also checkout where our usMeaslesPerYear dataframe ends. The last entry is 2001

tail(usMeaslesPerYear, 10)

We use a bar plot to visualize our data. We can see that the years 1928 to 1966 had very high incidence rates, and then 1966 and onwards saw a sharp decline in the number of measles cases. It should be noted that the measles vaccine was first introduced in 1963 by John Enders but then a better version was introduced in 1968 by Maurice Hillman[5]. In our barplot, we observe a steep fall in the number of measles cases after the introduction of the Maurice Vaccine.

ggplot(usMeaslesPerYear, mapping = aes(x = year, y = case_per_year, fill = year)) +
  geom_bar(stat = "identity") +
  labs(title = "Measles Cases Per Year")

We take a closer look at the 1980s to 2000s which are hard to see on the barplot

usMeasles80s <- usMeaslesPerYear %>%
                filter(year >= 1981 & year <= 2001)

We use filter() function to get data from the year 1981 to 2001, save it in the usMeasles80s dataframe and take a look at it

usMeasles80s

We use ggplot to visualize the data and save it into the usMeasles80sPlot dataframe

usMeasles80sPlot <- ggplot(usMeasles80s, aes(x = year, y = case_per_year, fill = year)) +
  geom_area() +
  geom_line() +
    labs(title = "Measles Cases Per Year 1981-2000")

We checkout the usMeasles80sPlot and can see that there was a slight peak in cases in the late 80s

usMeasles80sPlot

4.1.3 Data Loading and Clean-Up of WHO vaccination data

We got some data on measles vaccinations from the WHO website in excel format. To open it, we install the readxl package

#install.packages("readxl")

We load our readxl package

library(readxl)

We load our excel data into the measlesVacc dataframe initally using the absolute path.

measlesVacc <- read_excel("C:/Users/hp/Downloads/data viz/measles_vaccination1.xlsx")
## New names:
## • `` -> `...68`

We convert our excel data into csv data

write.csv(measlesVacc, "measles_vaccination1.csv")

Note That: We manually cleaned up the measles_vaccination1.csv file with our text editor, removed unwanted meta data and renamed the file measles_vaccination3.csv

measlesVacc2 <- read.csv("measles_vaccination3.csv")

We read the csv data and save it into the measlesVacc2 data frame

measlesVacc2

Our data contains vaccination data from many countries but we are only concerned with the United States so we filter out the data we need and save it in usVaccData

library(dplyr)
 usVaccData <- filter(measlesVacc2, CountryCode == "USA")

We check our usVaccData

usVaccData

We check the class of the variables in usVaccData

sapply(usVaccData, class)
##    CountryName    CountryCode Indicator.Name Indicator.Code          X1960 
##    "character"    "character"    "character"    "character"      "logical" 
##          X1961          X1962          X1963          X1964          X1965 
##      "logical"      "logical"      "logical"      "logical"      "logical" 
##          X1966          X1967          X1968          X1969          X1970 
##      "logical"      "logical"      "logical"      "logical"      "logical" 
##          X1971          X1972          X1973          X1974          X1975 
##      "logical"      "logical"      "logical"      "logical"      "logical" 
##          X1976          X1977          X1978          X1979          X1980 
##      "logical"      "logical"      "logical"      "logical"      "numeric" 
##          X1981          X1982          X1983          X1984          X1985 
##      "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
##          X1986          X1987          X1988          X1989          X1990 
##      "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
##          X1991          X1992          X1993          X1994          X1995 
##      "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
##          X1996          X1997          X1998          X1999          X2000 
##      "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
##          X2001          X2002          X2003          X2004          X2005 
##      "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
##          X2006          X2007          X2008          X2009          X2010 
##      "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
##          X2011          X2012          X2013          X2014          X2015 
##      "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
##          X2016          X2017          X2018          X2019          X2020 
##      "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
##          X2021          X2022 
##      "numeric"      "logical"

The early years in usVaccData have a lot of NA values so we select only data from 1981 onwards

usVaccDataFinal <- select(usVaccData, -(X1960:X1980))

We checkout usVaccDataFinal

usVaccDataFinal

Now we have immunization data from 1981 to 2001. However, the data is not entered a usable format for us. To fix this, we use tribble() function to create a new dataframe with the same information in usVaccDataFinal but in a format that is more amenable to use.

usMeaslesVaccRate <- tribble(~years, ~immunization_rate,
        1981, 97,
        1982, 97,
        1983, 98,
        1984, 98,
        1985, 97,
        1986, 97,
        1987, 82,
        1988, 98,
        1989, 94,
        1990, 90,
        1991, 87,
        1992, 83,
        1993, 84,
        1994, 89,
        1995, 88,
        1996, 91,
        1997, 91,
        1998, 92,
        1999, 92,
        2000, 91,
        2001, 91)

We now have a new dataframe called usMeaslesVaccRate with two variables, years and immunization_rate

usMeaslesVaccRate

We use a ggplot to visualize the data and save it in usMeaslesVaccRatePlot

usMeaslesVaccRatePlot <- ggplot(usMeaslesVaccRate, aes(x = years, y = immunization_rate, fill = years)) +
  geom_area() +
  geom_line() +
      labs(title = "Measles Immunization Rate from Years 1981-2001")

4.1.4 Data Analysis of our Measles Vaccination Dataset

We chekout what the usMeaslesVaccRatePlot looks like. We can see that there is a small decline in the vaccination rate in the late 80s

usMeaslesVaccRatePlot

4.1.5 Comparison of Measles Incidence Plot and Measles Vaccination Rate Plot

We load the patchwork libray so we can get the usMeasles80sPlot and usMeaslesVaccRatePlot side by side

library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.2

We compare both plots and can see that the sligh peak in masles cases in the late 80s overlaps with the slight decrease in vaccination rates in the late 80s

usMeasles80sPlotAndusMeaslesVaccRatePlot <- usMeasles80sPlot + usMeaslesVaccRatePlot
usMeasles80sPlotAndusMeaslesVaccRatePlot + plot_layout(heights = c(3,1))

4.1.6. Statistical Analysis of Measles Incidence Rate with Vaccination Rate

First, we section some data that we need. We get the incidence of measles cases from 1931 to 2001. However, since we know our vaccination data lacks entries for the early periods of vaccination, ie, from 1963 to 1979, we have to remove those years from our incidence data too.

usMeasles30s <- usMeaslesPerYear %>%
                filter(year >= 1931 & year <= 2001 & !(year >= 1963 & year <= 1979))

We know that before the introduction of the measles vaccine to the public in 1963, and again in 1968, there were no children vaccinated against measles. That means the vaccination rate against measles before 1963 is 0%.

usMeaslesVaccRate30s <- tribble(~years, ~immunization_rate,
        1931, 0,
        1932, 0,
        1933, 0,
        1934, 0,
        1935, 0,
        1936, 0,
        1937, 0,
        1938, 0,
        1939, 0,
        1940, 0,
        1941, 0,
        1942, 0,
        1943, 0,
        1944, 0,
        1945, 0,
        1946, 0,
        1947, 0,
        1948, 0, 
        1949, 0,
        1950, 0,
        1951, 0,
        1952, 0,
        1953, 0,
        1954, 0,
        1955, 0,
        1956, 0,
        1957, 0,
        1958, 0,
        1959, 0,
        1960, 0,
        1961, 0,
        1962, 0,
        1963, NA,
        1964, NA,
        1965, NA,
        1966, NA,
        1967, NA,
        1968, NA,
        1969, NA,
        1970, NA,
        1971, NA,
        1972, NA,
        1973, NA,
        1974, NA,
        1975, NA,
        1976, NA,
        1977, NA,
        1978, NA,
        1979, NA,
        1980, 97,
        1981, 97,
        1982, 97,
        1983, 98,
        1984, 98,
        1985, 97,
        1986, 97,
        1987, 82,
        1988, 98,
        1989, 94,
        1990, 90,
        1991, 87,
        1992, 83,
        1993, 84,
        1994, 89,
        1995, 88,
        1996, 91,
        1997, 91,
        1998, 92,
        1999, 92,
        2000, 91,
        2001, 91)

We’ve assinged 0% as the immunization rate for years before 1963, and NA for the years 1963 to 1979 because even though vaccinations were available in those years, we do not have the data. But to proceed, we must remove the NA values from our data.

usMeaslesVaccRate30s <- usMeaslesVaccRate30s %>%
                        filter(!is.na(immunization_rate))

We check out usMeasles30s

usMeasles30s

We check out usMeaslesVaccRate30s

usMeaslesVaccRate30s

We do a correlation test using Pearson’s product moment test. The value of our correlation coefficient is -0.8 which indiates that there is a negative correlation between the incidence of measles cases and immunization rate. Our p value also indicates that the correlation is statistically significant.

cor.test(usMeasles30s$case_per_year, usMeaslesVaccRate30s$immunization_rate)
## 
##  Pearson's product-moment correlation
## 
## data:  usMeasles30s$case_per_year and usMeaslesVaccRate30s$immunization_rate
## t = -12.273, df = 52, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9180150 -0.7728455
## sample estimates:
##        cor 
## -0.8621872

4.1.7 Modeling

We want to model our y dependent variable (case_per_year) on our independent variable x variable (immunization_rate). for this, we create a new dataframe called modelData

modelData <- usMeasles30s %>%
              select(y = case_per_year) %>%
              mutate(x = usMeaslesVaccRate30s$immunization_rate)

We fit a linear regression on our modelData

modelDataMod <- lm(y ~ x, data = modelData)

We find the intercept and coefficient

coef(modelDataMod)
## (Intercept)           x 
##  585953.839   -6326.999

To make predictions, we need to load the modelr pacakge

library(modelr)
## Warning: package 'modelr' was built under R version 4.3.1

We use the data_grid() function to generate an evenly spaced grid of values that covers the region where out data lies

grid <- modelData %>%
        data_grid(x)
grid

Next we use the add_predictions function to add predictions

grid <- grid %>%
        add_predictions(modelDataMod)
grid

We plot our predictions using ggplot2

ggplot(modelData, aes(x)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred), data = grid, color = "red", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We add residuals to tell us what our model has missed

modelData <- modelData %>%
             add_residuals(modelDataMod)
modelData

We visualize our residuals with a frequency polygon

ggplot(modelData, aes(resid)) +
  geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. RESULTS

    Our data shows that the introduction of the Maurice Hillman vaccine in 1968 caused a sharp decline in the incidence of measles in the United States. Our data also shows that a slight decrease in the immunization rate in the late 80s coincided with a slight increase in measles cases over the same time period.

  2. DISCUSSION

    We had to get a bit creative with our statistical analysis in 4.1.6.However, assigning 0% as the vaccination rates for the years before the licensing of any measles vaccine makes sense. And when we do that, we can see that there is a strong negative correlation between measles incidence and vaccination rates.

  3. CONCLUSION

    Our aim is to present the data to vaccination skeptics and show them how the discovery and introduction of the measles vaccine caused a sharp decrease in the incidence of measles in the late 60s. Millions of lives have been saved by the measles vaccine since its discovery. If measles vaccination rates continue to fall in the aftermath of the covid-19 pandemic, it will have real consequences for public health.

  4. CITATIONS